PART A

• DOMAIN: Automobile

• CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

• PROJECT OBJECTIVE: To understand K-means Clustering by applying on the Car Dataset to segment the cars into various categories.

1. Data Understanding & Exploration:

A. Read ‘Car name.csv’ as a DataFrame and assign it to a variable.

* Import all the necessary libraries.
THERE ARE TOTAL OF 398 RECORDS IN THE CSV FILE WHICH ARE CAR NAMES AND THE DATA IS LOADED INTO CAR_NAME DATASET

B. Read ‘Car-Attributes.json as a DataFrame and assign it to a variable.

THERE ARE TOTAL OF 398 RECORDS IN THE DATA FRAME CAR_DATA

C. Merge both the DataFrames together to form a single DataFrame

THE MERGED DATA FRAME CARS CONTAINS 398 ROWS OF CAR MODEL NAMES AND RELATED DATA.
FROM THE ABOVE INFORMATION,
    * THERE ARE 398 FIELDS IN ALL THE COLUMNS.
    * THERE ARE NO NULL FIELDS IN THE DATA.
    * HORSE POWER COLUMN CONTAINS INTEGER DATA BUT REPRESENTED AS OBJECT DATA TYPE.
* THERE ARE 6 ROWS THAT CONTAINS THE ? IN THE HORSEPOWER COLUMN

* LET US DROP ALL THOSE ROWS THAT CONTAINS ? IN FURTHER ANALYSIS

D. Print 5 point summary of the numerical features and share insights.

* THERE ARE NO NEGATIVE VALUES IN THE COLUMNS
* DISP AND WT COLUMNS ARE MORE RIGHT SKEWED
* PRESENSE OF OUT LIERS IN THE DISP AND WT COLUMNS

2. Data Preparation & Analysis:

A. Check and print feature-wise percentage of missing values present in the data and impute with the best suitable approach.

* THERE ARE NO NULL / MISSING VALUES IN THE CARS DATA FRAME

B. Check for duplicate values in the data and impute with the best suitable approach.

* THERE ARE NO DUPLICATE VALUES PRESENT IN THE CARS DATA FRAME

C. Plot a pairplot for all features.

*WE USED HUE ON ORIGIN AND PLOTTED PAIR PLOT FOR ALL THE COLUMNS IN THE CARS DATA

D. Visualize a scatterplot for ‘wt’ and ‘disp’. Datapoints should be distinguishable by ‘cyl’.

E. Share insights for Q2.d.

*WE HAVE DRAWN SCATTER PLOT AGAINST WEIGHT AND DISPLACEMENT FROM THE CARS DATA.

*WE USED CYCLINDER COLUMN DATA TO DISTINGUISH THE DATA POINTS.

*CYLINDER VALUE 8,4,6 HAVE MORE DATA POINTS AND 3,5 HAS LESS DATA POINTS

F. Visualize a scatterplot for ‘wt’ and ’mpg’. Datapoints should be distinguishable by ‘cyl’.

G. Share insights for Q2.f.

* WE HAVE DRAWN A SCATTER PLOT FOR WEIGHT AGAINST MPG WITH CYLINDER AS HUE

* CYLINDER WITH 4,6,8 HAVE MAXIMUM DATA POINTS AGAINST CYLINDER VALUE 3,5

H. Check for unexpected values in all the features and datapoints with such values.

* WE HAVE PRINTED UNIQUE VALUES IN ALL THE COLUMNS OF THE CAR DATA.

* COLUMN HP HAS "?" IN SOME OF THE ROWS.
*THERE ARE 6 ROWS IN THE CARS DATA FRAME THAT CONTAINS VALUES "?"

*WE WILL DROP ALL THOSE 6 ROWS THAT CONTAINS THE SPECIAL CHARACTER "?"
*BELOW PAIRS ARE HIGHLY CORRELATED:
    .CYLINDER AND DISPLACEMENT
    .WEIGHT AND DISPLACEMENT

*BELOW PAIRS ARE LOOSELY CORRELATED:
    .MPG AND CYLINDER
    .DISPLACEMENT AND MPG
    .WEIGHT AND MPG
    .HORSEPOWER AND MPG
* MODEL YEAR HAS MIN VALUE 70 AND MAX VALUE 82.
* WE ASSUME THAT THE SEGMENTATION PROCESS IS BEING CARRIED OUT IN THE YEAR 83.
* WITH THIS ASSUMPTION, WE CAN CALCULATE THE AGE OF THE VEHICLE BY FORMULA 83 - MODEL YEAR
* ONCE THE AGE IS CALCULATED, WE CAN DROP THE YEAR COLUMN.
* AGE IN THIS SCENARIO REPRESENTS FROM HOW MANY YEARS THE CAR IS BEING USED
* ORIGIN HAS THREE VALUES 1,2,3. CARS BELONGING TO ORIGIN 1 ARE MORE THAN 2 AND 3 COMBINED.
* WE CAN CREATE 3 VARIABLE FOR 1,2,3 AND NAME THEM AS ORIGIN_1, ORIGIN_2, ORIGIN_3
* THIS HELPS THE MODEL IN UNDERSTANDING THE TRAINING DATA AND CAN BE RESCALED EASILY.
*THERE ARE PRESENCE OF OUTLIERS IN HORSE POWER, ACCELERATION AND MPG COLUMNS
*DEALING WITH OUTLIERS CAN HELP US IN BETTER TRAINING THE MODEL.
*WE CAN EITHER REMOVE THE OUTLIERS OR TRANSFORM THEM TO FIT THE DATA
*SINCE THE AMOUNT OF OUTLIERS ARE VERY LOW, LET US FIT THEM INTO OUR DATA.
*WE CAN EITHER USE LOG TRANSFORM OR WINSORIZE OR ANY ROBUST SCALER METHOD.
*I WILL BE PROCEEDING WITH WINSORIZE TO DEAL WITH OUTLIERS.
*ON APPLYING WINSORIZE TO DEAL WITH OUTLIERS
*AFTER TRANSFORMING OUTLIERS, WE CAN SEE THAT THERE ARE NO OUTLIERS IN THE NEW BOX PLOTS ABOVE.
* AFTER DROPPING THE COLUMNS THAT DOES NOT HAVE MUCH SIGNIFICANCE AND TRANSFORMING OUTLIERS,
    WE HAVE DRAWN CORRELATION MATRIX, PAIR PLOT AND HEAT MAP FOR THE NEW DATA SET FORMED.
* SINCE WE HAVE DEALTH WITH OUTLIERS AND OTHER FEATURES, WE HAVE OBTAINED FINAL DATA FRAME.

* NOW WE WILL PROCEED FOR SCALING THE DATA.

3. Clustering:

A. Apply K-Means clustering for 2 to 10 clusters.

B. Plot a visual and find elbow point.

*FROM THE ABOVE PLOT, WE CAN CONSIDER OUR ELBOW POINT AT 4.

C. On the above visual, highlight which are the possible Elbow points.

* FROM THE ABOVE VISUAL, WE CAN SEE THAT THERE IS SHARP CURVE AT 4
* AT 3 AND 7, WE CAN ALSO CONSIDER THEM AS ELBOW POINTS, BUT 4 SEEMS TO BE OPTIMUM POINT.
* HENCE WE CAN CONSIDER THE NO OF CLUSTERS AS 4.
* LET US KMEANS WITH 4 CLUSTERS.

D. Train a K-means clustering model once again on the optimal number of clusters.

* WE HAVE CREATED 4 CLUSTERS ON CARS DATA CONSIDERING THE ELBOW POINT AT 4.
* GROUP 0,1,2,3 ARE 4 CLUSTERS THAT ARE FORMED.
* CLUSTER 1 HAS MAX MPG VALUE. CARS WHICH ARE NEW TENDS TO HAVE MORE MILEAGE COMPARED TO OLD CARS.
* CLUSTER 0 HAS LOWEST MPG VALUE MAKING THE CLUSTER IS CONTAINING OLD CARS
* CLUSTER 0 HAS HIGHEST CYLINDERS WHERE AS CLUSTER 2 HAS LOWEST
* CLUSTER 0 HAS HIGHEST DISPLACEMENT WHERE AS CLUSTER 2 HAS LOWEST DISPLACEMENT
* CLUSTER 0 HAS HIGHEST HORSEPOWER WHERE AS CLUSTER 1 HAS LOWEST
* CLUSTER 0 HAS HIGHEST WEIGHT WHICH IS FINE CONSIDERING THEY CONTAIN MORE CYLINDERS WHICH IS PROPORTIONAL TO WEIGHT
* CLUSTER 0 HAS LOW ACCELERATION AS THEY ARE MORE WEIGHT WITH MORE CYLINDERS THAT REDUCES ACCELERATION
* CLUSTER 1 HAS HIGHEST ACCELERATION CONSIDERING THEY HAVE LESS CYLINDERS AND LESS WEIGHT
* CLUSTER 0 HAS MOST OF THEIR CARS FROM ORIGIN 1
* CLUSTER 1 HAS MOST OF THE CARS FROM ORIGIN 2,3 COMBINED AND THEN FROM 1

E. Add a new feature in the DataFrame which will have labels based upon cluster value.

* IN THIS STEP,WE WILL ADD THE CLUSTERS TO THE CARS DATA FRAME.

F. Plot a visual and color the datapoints based upon clusters.

G. Pass a new DataPoint and predict which cluster it belongs to.

*WE ARE CREATING THE STRUCT THAT CONTAINS THE DIFFERENT DATA POINTS WE WANT TO TEST THE MODEL.
*WE WILL CREATE A DATAFRAME OF 6 NEW DATA POINTS AND TEST THEM AGAINST THE MODEL.
*INSTEAD OF A SINGLE DATA POINT, I AM TAKING 6 DATA POINTS AND PREDICT THEM AGAINST THE MODEL
*WE WILL CHECK WHETHER THE TRAINED MODEL WILL CLUSTER THE DATA INTO THE 4 GROUPS THAT WERE CREATED EARLIER.
* WE HAVE CREATED 6 DATA POINTS TO TEST IN WHICH CLUSTERS THE NEW TEST DATA POINTS FITS INTO
* 3 OF THE CARS FITTED INTO CLUSTER 0, AND REMAINING EACH ONE OF THE CARS FITTED INT0 CLUSTER 1,2,3
* WE CAN SEE THAT OUR MODEL IS ABLE TO PREDICT AND GROUP THE NEW DATA POINTS PASSED FOR GROUPING

PART B

DOMAIN: Automobile

• CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

• DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

• All the features are numeric i.e. geometric features extracted from the silhouette.

• PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model and compare relative results.

1. Data Understanding & Cleaning:

A. Read ‘vehicle.csv’ and save as DataFrame.

*EXCEPT THE CLASS COLUMN, ALL THE REMAINING COLUMNS ARE NUMERICAL DATA.
* THE DATA FRAME CONTAINS OF 846 ROWS AND 19 COLUMNS
* THERE ARE PRESENSE OF NULL VALUES IN SOME OF THE COLUMNS
* GOING FURTHER WE NEED TO DEAL WITH THE NULL VALUES TO PERFORM TESTING ON THE DATA.
* THERE ARE TOTAL 3 CLASSES CAR, BUS AND VAN.
* THERE ARE 429 CARS, 218 BUS AND 199 VAN

B. Check percentage of missing values and impute with correct approach.

*HERE WE WILL CHECK FOR THE PRESENSE OF NULL VALUES IN THE DATA.
*WE CAN EITHER PROCEED WITH REPLACING THE NULL VALUES WITH MEDIAN OR DROP ALL THE ROWS WITH NULL VALUES.
*IN THIS CURRENT SCENARIO, WE WILL PROCEED WITH REPLACING THE NULL VALUES IN THE COLUMNS WITH MEDIAN.

C. Visualize a Pie-chart and print percentage of values for variable ‘class’.

* CLASS COLUMN CONTAINS THREE TYPE OF VEHICLES : VAN, BUS, CARS
* OUT OF THREE CLASS CAR HAS MORE ROWS AND VAN HAS LESS ROWS.
* NO OF CARS ARE MORE THAN VAN AND BUS COMBINED.

D. Check for duplicate rows in the data and impute with correct approach.

* WE ARE GOING TO CHECK FOR DUPILICATE ROWS AND EITHER DROP THEM OR IMPUTE THEM.
* THERE ARE NO DUPLICATE VALUES PRESENT IN THE DATA SET.
* THERE ARE PRESENSE OF OUTLIERS IN THE DATA
* WE WILL PROCEED WITH TREATING THE OUTLIERS USING A SUITABLE APPROACH.
* BELOW ARE THE COLUMNS WITH OUTLIERS DATA
* 'radius_ratio','pr.axis_aspect_ratio','max.length_aspect_ratio','scaled_variance', 
  'scaled_variance.1','scaled_radius_of_gyration.1','skewness_about','skewness_about.1'
* WE CAN SEE THAT NOW THE OUTLIERS ARE TOTALLY DEALT WITH.
* THROUGH LABEL ENCODING, WE HAVE CONVERTED THE CATEGORICAL VARIABLE.
* 1 REPRESENTS CARS, 0 REPRESENTS BUS AND 2 REPRESENTS VAN.

2. Data Preparation:

A. Split data into X and Y. [Train and Test optional]

B. Standardize the Data.

* FROM THE ABOVE PLOTS, WE CAN SEE THAT AFTER STANDARDIZING, SCALE OF THE VARIABLES CAME DOWN TO VERY LOW VALUES.

3. Model Building:

A. Train a base Classification model using SVM.

* WE HAVE RUN THE SVM MODEL USING LINEAR, POLY AND RBF
* SVM_LINEAR MODEL IS SHOWN TO PERFORM BEST OUT OF OTHER TWO MODELS.
* WE WILL PROCEED TO USE THE LINEAR MODEL AND PRINT THE SCORES FOR TRAIN AND TEST DATA
* WE COULD SEE THAT TRAIN DATA ACCURACY IS ABOVE 95% AND TEST DATA ABOVE 93%
* SVM LINEAR MODEL BEST FIT THE DATA.

B. Print Classification metrics for train data.

C. Apply PCA on the data with 10 components.

D. Visualize Cumulative Variance Explained with Number of Components.

E. Draw a horizontal line on the above plot to highlight the threshold of 90%.

* IN THE ABOVE PLOT A DASHED RED LINE IS DRAWN ACROSS THE PLOT THAT REPRESENTS THE THRESHOLD OF 90%

* AT THE POINT WHERE THE LINE CUTS THE CUMULATIVE VARIANCE, WE CAN SEE THAT 5 PRINCIPLE COMPONENTS
    ARE ENOUGH TO BE AT 90% THRESHOLD.

F. Apply PCA on the data. This time Select Minimum Components with 90% or above variance explained.

* FROM THE ABOVE PLOT, WE CAN SEE THAT 90% THRESHOLD CROSS THROUGH THE 5 COMPONENTS

* WE CAN SELECT 6 COMPONENTS WHERE THE THRESHOLD IS CROSSING ABOVE 90%

G. Train SVM model on components selected from above step.

H. Print Classification metrics for train data of above model and share insights.

* WE HAVE SELECTED ONLY 6 COMPONENTS THAT CAN BE USED TO ACHIEVE 90% OF MODEL ACCURACY OVER THE EARLIER MODEL
* ON REDUCING THE COMPONENTS, WE CAN SEE THAT ACCURACY, PRECISION, RECALL, F1-SCORE GOT REDUCED.
* STILL THE MODEL IS HAVING ACCURACY OF 83.7% ON TEST DATA AND 80.3% ON TEST DATA.

4. Performance Improvement:

A. Train another SVM on the components out of PCA. Tune the parameters to improve performance.

* WE HAVE RAN THE SVM MODEL ON DIFFERENT SELECTION OF PRINCIPLE COMPONENTS.

* AS THE NO. OF COMPONENTS ARE GETTING INCREASED, WE CAN SEE THAT MODEL PERFORMANCE IS ALSO INCREASED.

* AT 13 COMPONENTS, WE CAN SEE THAT MODEL IS ACHIEVING MORE THAN 90% ACCURACY.

* WHEN ALL COMPONENTS ARE SELECTED, WE CAN SEE THAT MODEL IS HAVING MAXIMUM ACCURACY OF 95.77%

* AT 15 COMPONENTS MODEL ACCURACY IS 94.59% AND AFTER THAT WE COULD SEE ONLY MARGINAL INCREASE IN ACCURACY.

* FROM NO. OF COMPONENTS AT 5, MODEL STARTED PERFORMING GOOD.

* IN THIS ASSESMENT TO KEEP THE MODEL PERFORMANCE AT 90%, I AM CONSIDERING 13 COMPONENTS TO TRAIN THE 
  SVM MODEL FURTHER AND PARAMETER TUNING
* SINCE WE ARE USING RANDOMISED SEARCH CV FUNCTION TO FIND BEST HYPER PARAMETERS DO REMEMBER THAT:
    * EVERY TIME WE RUN THE ABOVE CODE, EXPECT TO GET A NEW BEST HYPER PARAMETERS AT EVERY INSTANCE
    * TO REDUCE THE COMPUTATION TIME AND OVER LOAD ON THE SYSTEM RESOURCES, I PREFERED TO USE THIS METHOD
      OVER GRID SEARCH SINCE GRID SEARCH IS TAKING TOO MUCH OF TIME TO RUN.
* WE CAN EITHER USE GRID_SEARCHCV OR RANDOMISED_SEARCHCV FOR FINDING BEST HYPER PARAMETERS.
* I USED RANDOMISED_SEARCHCV TO FIND BEST PARAMETERS TO REDUCE COMPUTATION TIME OVER USING GRID_SEARCHCV.
* WE RE-TRAINED THE SVM MODEL WITH THE BEST HYPER PARAMETERS AND RESULTS WERE PUBLISHED.
* FROM THE ABOVE OUTPUT WE CAN DEDUCT BELOW FINDINGS:
                    * MODEL PERFORMANCE BEFORE PCA WITH BEST PARAMETERS IS AT MAX 100%
                    * ON REDUCING DIMENSIONS TO 6, MODEL STILL PERFORMED GREAT AT 93.5% ACCURACY
                    * WE CAN SEE THAT AT 13 DIMENSIONS, MODEL TRAINING DATA ACCURACY IS ALMOST NEAR TO 100%
* DO NOTE THAT FOR EVERY SET OF OUTPUT PARAMETERS FROM RANDOMISED SEARCH CV, WE GET VARIATION IN THE RESULTS

B. Share best Parameters observed from above step.

* WE USED RANDOMISED_SEARCHCV TO FIND THE BEST PARAMETERS WHICH ARE AS BELOW:

    KERNEL: RBF
    GAMMA: 0.1
        C: 17

    ALSO WE COULD SEE THAT ON REDUCING TO 6 DIMENSIONS WE ARE STILL GETTING GOOD MODEL PERFORMANCE

    ACCURACY IS REDUCED ON REDUCING THE DIMENSIONS BUT STILL WE ARE HAVING THE ACCURACY MORE THAN 90%

C. Print Classification metrics for train data of above model and share relative improvement in performance in all the models along with insights.

* LOOKING AT THE ABOVE SCORES, WE CAN SEE THAT MODEL IS HAVING MORE THAN 90% ACCURACY ON ALL 3 SCENARIOS.

* ON RAW DATA, TRAINING ACCURACY IS HIGH AT 98%

* WE HAVE REDUCED THE DIMENSIONS TO 6 AND 13 USING PCA.

* AT 6 DIMENSIONS, ACCURACY IS STILL 92% SHOWING THE COMPONENTS ARE CORRELATED STRONGLY

* AT 13 DIMENSIONS, TRAINING ACCURACY IS 94% WHICH SHOWS THE INCREASE IN HIGHLY CORRELATED COMPONENTS.

* OVER ALL AFTER PCA DIMENSIONALITY REDUCTION, MODEL IS STILL PERFORMING VERY BEST.

5. Data Understanding & Cleaning:

A. Explain pre-requisite/assumptions of PCA.

WHAT IS PCA?
************

* PCA IS A UN-SUPERVISED LEARNING TECHNIQUE THAT HELPS IN REDUCING THE DIMENSIONALITY IN THE INPUT DATA SET.

* WHEN A DATA SET IS HAVING HIGHER DIMENSIONALITY, PCA TREATS EACH VARIABLE AS PRINCIPLE COMPONENT AND THEN
  PAIRS WITH OTHER VARIABLE BY IDENTIFYING THE RELATIONSHIP BETWEEN THEM.

* BY THIS WE CAN SEE THAT THE FIRST PRINCIPLE COMPONENT WILL BE HAVING MORE VARIATION AND IT GETS DECREASED AS
  WE MOVE DOWN TO LAST PRINCIPLE COMPONENTS WHICH HAVE LEAST VARIATION.

* AS WE REDUCE THE DIMENSIONS, WE COULD THE DECREASE IN ACCURACY OF THE MODEL.

* THIS DIMENSIONALITY REDUCTION ALSO ALLOWS TO TRAVERSE THROUGH MINIMUM COLUMNS AND VISUALIZE THEM EASILY

* THIS MAKES THE MACHINE LEARNING MODEL EASIER TO ANALYSE AND MAKES IT FASTER TO PROCESS.

* INORDER TO ACHIEVE THIS WE NEED TO HAVE SOME ASSUMPTIONS ON THE DATA SET WHICH WILL BE DISCUSSED BELOW.

PRE-REQUISITE / ASSUMPTIONS OF PCA:
***********************************

THERE ARE SOME ASSUMPTIONS TO BE MADE THAT MAKES THE DIMENSIONALITY REDUCTION EASIER:

1. DATA SET IS LINEAR: ALL THE VARIABLES IN THE DATASET ARE IN LINEAR MANNER AND CONTAINS RELATIONSHIPS AMONG
   THEM SELVES.

2. INDEPENDENT VARIABLES IN THE DATA SET ARE HIGHLY CORRELATED TO EACH OTHER AND REDUCED FEATURE SET REPRESENTS
   THE ORIGINAL DATA SET IN EFFECTIVE MANNER.

3. DATA SET CONTAINS VERY LESS NUMBER OF OUTLIERS WHICH DEVIATES FROM MOST OF THE DATA POINTS AND HAVING MORE
   OUTLIERS MEANS THAT ERRORS ARE MORE AND REDUCES MODEL OVERALL PERFORMANCE.

4. ALL THE FEATURES ARE LOW DIMENSIONAL AND NUMERIC IN NATURE.

5. PRINCIPLE COMPONENTS WITH HIGHER VARIANCE ARE GIVEN UTMOST IMPORTANCE WHERE AS PRINCIPLE COMPONENTS WITH 
   LOWER VARIANCE ARE TREATED AS NOISE.

B. Explain advantages and limitations of PCA.

ADVANTAGES OF PCA:
******************

* DIMENSIONALITY REDUCTION HELPS IN VISUALIZING THE DATA SET EASIER DUE TO LESS FEATURES THAT MAKES IT EASIER

* HELPS IN FINDING THE MOST IMPORTANT FEATURES OR VARIABLES THAT ARE NOT CO-RELATED

* INCREASES THE SPEED OF MACHINE LEARNING ALGORITHM AS THERE ARE LESS FEATURES TO ANALYZE. IN AN OVERALL PCA
  INCREASES OVERALL ALGORITHM PERFORMANCE.

* REDUCES THE OVER FITTING OF DATA. MORE VARIABLES CAUSES OVER FITTING OF DATA AND AS PCA REDUCES DIMENSIONALITY
  WE HAVE LESS FEATURES OR ONLY IMPORTANT FEATURES THAT REDUCES OVER FITTING OF THE DATA.

* SIMPLIFY THE COMPLEX BUSINESS PROBLEMS AS IN PCA WE NEED TO TRAIN THE MODEL ONLY ON THE PRINCIPAL COMPONENTS 
  THAT REDUCES THE SIZE OF THE VARIABLES THAT NEEDS TO BE ANALYSED EARLIER.

LIMITATIONS OF PCA:
*******************

* PCA RELIES ON GIVING PRIORITY TO VARIABLES THAT HAVE MORE VARIATION ON THE DATASET AND HENCE WE ARE 
  LOSING SOME FEATURES WHEN TRAINING THE MODEL USING PCA, THUS LEADING TO INFORMATION LOSS.

* BEFORE APPLYING PCA, WE MUST STANDARDISE THE DATA SET, IF WE DO NOT STANDARDISE THE DATA, IT BECOMES DIFFICULT
  FOR PCA TO FIND OUT THE IMPORTANT FEATURES.

* THE STANDARDISATION HAS TO BE DONE ON ALL THE FEATURES BEFORE APPLYING PCA BUT ON LATER STAGE WE TENDS TO
  USE ONLY COLUMNS THAT HAS HIGH VARIATION ON THE DATA.

* SINCE WE ARE STANDARSING THE DATA AND USING THE PRINCIPLE COMPONENTS, WE CANNOT INTERPRET THE ORIGINAL 
      VARIABLES IN THE DATA SET. ALSO NOTE THAT THESE PRINCIPAL COMPONENTS ARE NOT AS READABLE AS ORIGINAL DATA

* WE ARE ASSUMING THE DATA SET TO BE LINEAR AND HENCE IT IS NOT SUITABLE TO CAPTURE THE NON-LINEAR DATA SET.

* HAVING MORE OUTLIERS IN THE DATA CAN MAKE THE PCA TO IDENTIFY THE PRINCIPAL COMPONENTS DIFFICULT.
                   ********** END OF UNSUPERVISED LEARNING ASSESMENT **********